Using TF and OpenCV to differentiate between leukocytes

This project aims to provide a model that will be able to identify different types of leukocytes in an image of a blood sample. It is presented as a simple case study.

Although all leukocytes participate on the protection of our bodies from pathogens in some way or another, the exact function of a leukocyte depends on its type. It is important to be able to discern between these types if we want to study immune responses. In this notebook I devise a model that could be used to analyse samples quickly and accurately.

You can find the final model in the last section of this notebook.

Dataset overview and cleaning

I got the dataset from Kaggle. Main characteristics:

Note that there is an alternative dataset on Kaggle with augmented data and some additional annotations, but we won't use it to simulate more typical research conditions.

First, let's load the dataset and look at the number of each leukocyte type.

As you can see above, the dataset is very unbalanced. It also seems that some images contain more than one leukocyte. Let's check that by looking at some of the images.

There's only a few images that contain more than one cell'; for the sake of simplicity, I removed them from the dataset.

Preparation of training, validation, and testing datasets

Now that I know a little bit about the data I'll be working with, I can build the training, validation, and testing datasets. I'll use 25% of the images for testing, and 80% of the rest for training.

Model building, fitting, and evaluation

All of the models in this notebook are based on ResNet, although much smaller, because the dataset is simpler and smaller.

The model contains a configurable number of blocks that are each made from two CNN layers with 3 x 3 kernels followed by a batch normalisation layer. For added regularization I apply a 20% dropout between the two CNN layers. The blocks are connected sequentially, and there are residual connections between their connection points.

The important detail in the compile function is that I'm applying exponential decay to the learning rate. I'm using the Adam optimizer which seems to be the industry standard.

I use a custom fit function to supply callbacks for the training phase:

I also shuffle the training set before the start of the training to ensure the batches are (pseudo)random.

Finally, I use a custom evaluation function to quickly visualize main insights about the training of the model and its performance on the test set. I use Cohen's Kappa as the performance metric. Compared to accuracy and F1-score, Cohen's kappa statistic is a better fit for problems with unbalanced label distributions such as this one. It can be interpreted as the magnitude of agreement between the reality (the golden labels) and the model (the predicted labels) that can't be attributed tu pure chance.

First model, baseline

First, I scale the images from 640x480 to 256x192 to make the training faster. The model won't care about the dimenstions of the input because it's made from CNNs.

The baseline model has two blocks, the first with 16 channels and the second with 32.

Takeaways: The model clearly suffers from the dataset being imbalanced. Almost every cell is labelled as a neutrophile.

Next step: It might be useful to resample the data to balance the number of different leukocytes.

Second model, balanced dataset

There are many methods of resampling data. In some areas, SMOTE has been gaining some popularity — a smart resampling method that generates new observations by sampling the feature space near existing observations. In our case, a simple oversampling of underrepresented classes should be enough.

Takeaways: Training the same model for the same time on a more balanced dataset has yielded a dramatic improvement in performance (over 730%).

Next step: The model now has to find the cell in the image and then identify its type. Let's speed up the training (and mabye vene improve the performance) by isolating the cell in the image manually using open CV and feeding only the cells to the model.

Leukocyte isolation by image segmentation with OpenCV

Because I don't have information about the positions of leukocytes in the training samples, I can't train a neural network to peform the image segmentation and isolate the leukocytes. Luckily, the white blood cells in the samples have certain characteristics that make it possible to detect them using only Open CV:

  1. The background (i.e. area that is not an erytrocyte nor a leukocyte) is gray.
  2. The colour of erytrocytes leans toward red hues.
  3. The colour of leukocytes has a strong blue component.

I will construct simple masks to filter out (1) and (2) to be left only with the leukocytes.

Firstly, a function to detect the background.

Then, a function to detect the erytrocytes. I'll use the fact that their colors have a really weak blue component, or a strong red component.

The leukocyte masks (i.e. not background & not erytrocyte) themselves look like the following.

To finalize the leukocyte masks I will:

  1. Get rid of the little scraps and make the masks more round.
  2. Find the bounding boxes of the leukocytes.
  3. Finally, crop the leukocytes according to their bounding boxes.

I'll do (1) by opening the image, that is erosion followed by dilation.

(2) It is now trivial to find the contours of the leukocytes, because they are currently the biggest unmasked area of the image.

It is also easy to find their bounding boxes.

I crop the images and seize each cell to a 128 x 128 square. A little bit of stretching doesn't hurt, on the contrary, it helps the model to generalize well.

I will add these images to the original datasets and use them from now on to train the models.

Third model, balanced dataset with isolated cells

Takeaways: Compared with the previous model (see the evaluation below), the model trained on isolated cells has a slightly better performance, converges slightly faster, and its training is much quicker thanks to the images being much smaller.

Next, and the final, step: I'm quite happy with the model as it is; I will now add some layers to the model and lenghten the training process to get even better results. To make sure the model won't overfit, I will perform data augmentation.

Data augmentation

Increasing the model size without increasing the number of training samples would result in overfitting. That is why I will augment the dataset with additional observations that are made by modifying the existing ones.

Flipping and rotation.

Shear transformation.

Brightness adjustments.

Contrast adjustments.

And everything put together.

Final model, augmented and balanced dataset of isolated leukocytes

This model is almost ten times as good as the one I started with!

Comparison with other models on Kaggle

Other people on Kaggle mostly use the other dataset supplied with this problem, that already contains augmented data. This means that the metrics between my and other models can't be directly compared.

On the other hand, we might gain some insight from comparison with other models, regarding the number of parameters in the model and its performance.

The model has 134,853 parameters.

As everyone seems to be using accuracy, I calculated it for my model as well to have a comparison point. Here is a comparison with three most upvoted models that are publicly available on Kaggle.

Notebook Accuracy on the test set # of parameters
this model 94% 134,853
Identify Blood Cell Subtypes From Images 90% not provided
Blood Cell Keras Inception 86% (on validation set) 363,825
Deep Learning From Scratch + Insights 83% 16,732

Conclusion

Main takeaways:

As is, the model isn't suitable for clinical use. However, it performs very well considering we only had less than 400 labeled examples.